Toward Efficient Post-Training Quantization of Pre-Trained Language Models
129
FIGURE 5.7
The overview of algorithm proposed in [5].
In summary, this paper’s contributions are as follows: (1) new kernels for the efficient
and accurate integer-only GELU and Softmax. That is, the GELU and Softmax are approx-
imated with lightweight second-order polynomials, which can be evaluated with integer-only
arithmetic; (2) integer-only LayerNorm computation by leveraging a known algorithm for
integer calculation of square root [49]; and (3) a total integer-only quantization for language
models by utilizing the proposed approximations of GELU, Softmax.
5.5
Toward Efficient Post-Training Quantization of Pre-Trained
Language Models
Bai et al. [5] proposes MREM that aims at improving the performance of post-training
quantization for language models, while simultaneously maintaining the training efficiency,
memory overhead, and data accessibility equipped by post-training quantization. The al-
gorithm overview proposed in [5] is presented in Fig. 5.7. As can be seen, the full-precision
and quantized models are first partitioned into multiple modules, then put on different com-
puting devices. Each module samples input tensor from its input queue, which makes them
can be trained locally without waiting for their predecessors. Moreover, teacher forcing is
applied to mitigate the issue of reconstruction error propagation on the quantized module.
5.5.1
Module-Wise Reconstruction Error Minimization
At first, the language models are partitioned into multiple modules, each consisting of mul-
tiple transformer layers. Then, they propose module-wise reconstruction error minimization
(MREM) to optimize each module’s model weight and quantization parameters, which per-
mits sufficient optimization. Specifically, given a language model with L transformer layers,
embedding layers and the classification head, the model is partitioned into N modules. Sup-
pose the n-th module contains p transformer layers, then it include [lj, lj+1, lj+2, . . . , lj+p−1]
transformer layers with lj being the first layer of this module. The proposed MREM aims
at minimizing the joint reconstruction errors between all intermediate output ˆfl of the
quantized n-th module from its full-precision counterpart fl as follows:
Ln =
j+p−1
i=j
∥ˆfli −fli∥2.
(5.10)